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Abstract 

An integral part of many algorithms for S-estimators of linear regression is random subsam- 
pling. For problems with only continuous predictors simple random subsampling is a reliable 
method to generate initial coefficient estimates that can then be further refined. For data 
with categorical predictors, however, random subsampling often does not work, thus limiting 
the use of an otherwise fine estimator. This also makes the choice of estimator for robust 
linear regression dependent on the type of predictors, which is an unnecessary nuisance in 
practice. For data with categorical predictors random subsampling often generates singular 
subsamples. Since these subsamples cannot be used to calculate coefficient estimates, they 
have to be discarded. This makes random subsampling slow, especially if some levels of cate- 
gorical predictors have low frequency, and renders the algorithms infeasible for such problems. 
This paper introduces an improved subsampling algorithm that only generates nonsingular 
subsamples. We call it nonsingular subsampling. For data with continuous variables it is as 
fast as simple random subsampling but much faster for data with categorical predictors. This 
is achieved by using a modified LU decomposition algorithm that combines the generation of 
a sample and the solving of the least squares problem. 

1 Introduction 

In a nutshell, a random subsampling based algorithm for S-estimates of linear regression does the 
following. It takes a random sample of the observations of size equal to the number of predictors p, 
resulting in a square design matrix, solves a "least squares" problem on this reduced data set (which 
gives residuals in this case) and refines the resulting parameter estimate using a redescending 
M-estimate of regression with a simultaneous scale on the whole data set. This is repeated for a 
pre-specified number of times. This final S-estimate is then taken to be the one that resulted in 
the smallest scale estimate. Definitions of these methods are given in Section [2j 

The random subsampling algorithms described above work well for continuous data. For cate- 
gorical data, however, they are often slow or fail completely. The problem lies in the generation of 
good subsamples, i.e., square subsamples that contain no collinearities. Otherwise the subsample 
has to be discarded, since it does not generate a well defined starting value. We illustrate the 
problem by means of a simple example. Imagine a simple one-way ANOVA with 3 groups and 3 
repetitions each. Using treatment contrasts to encode the grouping structure, we get the following 
least squares problem. 
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In this example, the size of the subsample is 3. It is clear that the coefficients /3 can only be 
estimated if at least one observation of each group is part of the subsample. From the total number 
of possible subsamples, 84, only 27 correspond to a nonsingular subsample. Therefore, even in this 
very simple example, the probability of discarding a subsample because of collinearities is about 
two thirds. This probability is much higher when some levels of factors have low frequencies. Then 
an excessively large number of subsamples is required, rendering the simple random subsampling 
algorithms unfeasible for such data sets. 

Instead of discarding the whole sample, we propose to solve this problem by dropping the ob- 
servation causing the singularity and continue sampling. We call this refined subsampling strategy 
no nsingular subsamplin g. 

Mar onna and Yohail J2000) have proposed another approach on this problem. They solve it 



using a combination of M and S-estimates, called M-S-estimates. The categorical part is estimated 
separately with M-estimates. The continuous part is estimated using S-estimates. Their approach 
works well for data that contains only purely categorical and continuous variables. For interactions 
of categorical and continuous variables, however, it is not clear how to split the data. Using the 
M-estimate for the interaction will result in a loss of robustness, while using the S-estimate will 
again produce singular subsamples. 

In the next section we introduce the notation and provide definitions for all methods used. 
Then we develop the basic algorithms for M-estimates and S-estimates. In Section 2] we explain 
the nonsingular subsampling algorithm. Finally, we conclude with Section [SJ 



2 Notation and definitions 

Consider the standard multiple linear regression model, 

Vi = xJ§_+Si, e ~ 7V(0,cr 2 ) , i=l,...,n, 

with ei i.i.d., independent of and /3 of length p. Throughout this text we assume that the design 
matrix X, combining all into one large matrix, is of full rank p. We denote the residuals as 

Simultaneous M-estimates of regression and scale are defined as the solutions and a to, 



(1) 



(2) 

where k is a tuning constant, y is a so-called p-function, and ip is a ^-function. A p-function, as 
defined in iMaronna et all (|2006l) . is assumed to be a nondecreasing function of \r\, with p(0) = 
and strictly increasing for r > where p(r) < p(oo). If a p- function is bounded, we assume p(oo) = 
1 and call it a redescending p-function. A ^-function is the derivative of a p-function and is usually 
standardized so that "0'(O) = I. A solution a to @ for a given vector r, a(r), is called M-estimate 
of scale. The solutions to (|TJ) for redescending p-functions are local minima of the corresponding 
optimization problem and are called redescending M-estimates of regression. 

S-estimates of regression are the parameter value /3 that minimizes the M-estimate of scale 
a = o-(r([3)) of the associated residuals, 

$ = argmincr(r (/?)) . (3) 
_ P 

\ ■ ■ maximal breakdown point (1— p/n)/2 of the S-estimate is attained when using k = (I— p/n)/2. 
Note that solutions of ([3]) are always also the solution to a simultaneous M-estimation of regression 
and scale prob l em. F or an introduction to M-estimation and details about S-estimates, we refer to 
Maronna et~a] <|2006h . 
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A L U decomposition of a matrix A is defined as the product of a unit lower-triangular matrix 
L and an upper-triangular matrix U. Such a decomposition does not always exist, therefore in 
practice one usually uses an LU decomposition with partial pivoting. This is an LU decomposition 
of the matrix A where the rows have been reordered by a permutation matrix P T . The ordering 
is chosen in a way that minimizes numerical errors. The decomposed matrix can then expressed 
as, 

A = PLU. 

This form is useful to solve systems of linear equations A/3 = y. Using the LU decomposition this 
problem can be solved directly using a forward and a backward elimination. 



3 Basic algorithms 

First we describe algorithms to compute M-estimates of regression and M-estimates of scale. The 
eq uations (U) and flU) a re preferably solved using iterative reweighting algorithms. As outlined 



Maronnaet all (| 20061 ). one can rewrite Q and © to take the form of a weighted least squares 



problem and a weighted sample variance. The weights then correspond to the respective robustness 
weights. Starting from some suitable initial estimate, the iterative reweighting algorithm then 
simply repeats the following two steps until convergence. The first step consists of the computation 
of the weights using the results from the previous iteration. In the second step the weighted problem 
is solved using the weights computed in the first step. Convergence of this algorithm is usually not 
a problem. For redescending M-estimates the algorithm converges to a local minimum. One can 
easily get an algorithm solving the simultaneous M-estimation of regression and scale problem by 
combining the two iterations. 

We now come to the computation of S-estimates. Combining the above mentioned properties of 
S-estimates and M-estimates, we can derive a simple algorithm that we will later use as the basis for 
further improvements. See Algorithm [T] for a summary of the rest of the paragraph. Firstly, note 
that since any S-estimate is also a solution to ([I}, we can restrict the minimization to parameter 
vectors j3 that solve ([T]) and © simultaneously. Secondly, we use the iterative reweighting algorithm 
described above to generate such solutions from a set of initial estimates. Such a set of initial 
estimates may be found by solving all subproblems involving only p observations. Finally, we get 
the desired S-estimate as the solution resulting in the smallest scale estimate. 

Since for subproblems with p observations the design matrix is square, the least squares problem 
reduces to a set of linear equations. It has only a defined solution if the design matrix is invertible, 
i.e., there are no collinearities. We may simply discard all singular subproblems. Note that the 
check for singularity does not require an additional step. If the system of linear equations of the 
subproblem can be solved we may continue, otherwise we discard the subsample. 

Data: y, X. 
Result: /3, a. 

1 O i — CO 

2 for all subsets J C {1, . . . , n} of size p do 
if design matrix Xj contains no collinearities then 

h <- x j% 

<r- Solve © and ([T]) using iterative reweighting algorithm starting from /3 . 
if a j < a then 



Algorithm 1: Basic algorithm for the computation of S-estimates. 



This algorithm using exhaustive resampling, i.e., running over all possible subproblems, as 
shown in Algorithm [TJ is of course only suitable for small problems. For large problems, it is 
neither feasible nor sensible to look at all the subsamples. Instead, it is enough to consider only 
set of random subsamples. Depending on the expected proportion of outliers, we can use simple 
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combinatorics to determine how many random sub-samples are required to select at least one outlier- 
free subsample with a probability of, say, 0.999. The number of subsamples grows exponentially 
with p. Taking 1000 ra ndom subsamples has proven to work well in practice. Exact numbers 
are given in Table 5.3 of iMaronna et all (|2006l) . T he same reference also summarizes many more 
optimizations. Worth mentioning is the paper bv ISalibian-Barrera and Yohai ( 20061 ) where they 
develop a complete strategy for computing S-estimates, dealing also with very large data sets. 



4 Nonsingular subsampling 

Simple random subsampling algorithms have the drawback that they cannot guarantee the gener- 
ation of nonsingular subsamples, i.e., without collinearities. They work on a simple trial-and-error 
basis. In the following, we propose an algorithm that produces only nonsingular subsamples. The 
algorithm we propose has the advantage that it is much faster than simple random subsampling 
algorithms for hard problems, without sacrificing any time for easy problems. 

The nonsingular subsampling algorithm merges the two steps of generating the random subsam- 
ple and solving the system of linear equations. The latter consists of computing a LU decomposition 
(for a definition, see Section[2]) and then solving two triangular linear systems of equations. Instead 
of generating the whole subsample at once, we propose to select observation by observation. A 
new observation is only added to the subsample if it is not collinear to the observations already 
selected. Proceeding in this way, we will always end up with a nonsingular subsample. The speed 
up is achieved by using a modified LU decomposition algorithm. It computes the LU decompo- 
sition observation by observation, without having to recompute any results of previously selected 
observations if one observation needs to be dropped. So if the random subsample is nonsingular in 
the first place, e.g., for continuous predictors, the algorithm does the same as the simple random 
subsampling algorithms. But for singular subsamples, we can avoid restarting from scratch, simply 
dropping the observation and continue with the next candidate is enough. 

The modified LU decomposition algorithm is based on the so -called Gaxpy variant of the LU 
decomposition algorithm as found in lGolub and Van Loan! (|l996l ). It is of the same complexity as 



other LU decomposition algorithms. The Gaxpy variant delays the computation of columns of U 
until they are actually needed. To compute the ith column of L, we need only columns 1 to i of 
U. In case a singularity is detected, it is therefore enough to only repeat the last step using a new 
observation / column. Results obtained prior to this step do not need to be recomputed. 
The nonsingular subsampling algorithm is shown in Algorithm [2] 

The numerical stability of the algorithm can be improved by using a matrix preconditioning 
technique that reduces the condition number of the design matrix. We propose to use a method 
called equilibration, described, e.g., in Demmell (1997). Tests have shown that it is enough to apply 



this method to the complete design matrix, even if only sub-matrices are used later on. 



5 Conclusions 

Current algorithms for S-estimates of linear regression have trouble coping with categorical predic- 
tors, especially if interactions of continuous and categorical predictors are involved. These issues 
can be avoided by using nonsingular subsampling instead of simple random subsampling. By 
merging the steps of generating the subsample and fitting the least squares problem, the new sub- 
sampling algorithm can generate nonsingular subsamples much more efficiently than simple random 
subsampling for data sets with categorical predictors. Comparing the runtimes of S-estimates us- 
ing nonsingular subsampling and M-S-estimates showed only a modest increase of computing time 
(around 10%) even for quite large designs (n = 8088, p — 340, 2 of them continuous predictors). 
The nonsingular subsampling algorithm is implemented in the lmrob function of the R package 
robustbase from version 0.9-3. 
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Data: n x p matrix X , response vector y, singularity treshold e. 

Result: Return code (0 for success, otherwise number of failing step), initial estimate (3. 
## Initialize variables, pivoting table p, selected subsample index vector s 

1 J7<-0;Z/<— I; p<- 1 : p; s_<- 1 : p; k <- 1 
## Permutate observations randomly 

2 (f- permd : n) 

3 A <" X L:p 

4 y^Kt 

5 for j in 1 to p do ## Find non-singular subsample and calculate LU decomposition 



n 

12 

13 
14 

15 
16 

17 

18 
19 

20 
21 
22 

23 
24 

25 
26 



if j == 1 then v 1:p <- A 1:p ^ 
else 

## (Forward) solve to get required column of U 



1,3 



-1-Al:j-X,fc 



Lj: Pt l:j-lUl : j-l l j 



if j < p then 

## Find pivot 
/i <— argmax^ 
if > e then 

## Subsample is still non-singular 



fe 



## Swap elements of v and rows of A 



Vj O IV 
## Update L 
## Swap rows of £ 

_ ^ L f*,l--j-l 

if lE/l < £ then 

## Singularity detected: skip this column and try again if possible 
if k < n then 
k <- k + 1 
GotoH 

else ## Return with an error 
|^ return j 



k <- k 



## Solve X s i- p y s and undo pivoting 

27 $^L- J U~ J y s 

28 for j in p — 2 to do 



29 return 0, f3 



Algorithm 2: Constrained subsampling using modified Gaxpy variant of LU decomposition. We 
use vector index notation to select subvectors and submatrices. The index vector 1 : p — 1 in 
Ei-p-i indicates that an operation only acts on the elements 1 to p — 1 of the vector. 
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